This week's lab provides a practical introduction to visual data analysis. At the end of the lab, you should be able to use pandas to:
As you will see later, pandas provides a lot of different options for plotting data and we only have time to look at a few in detail. If you're interested, you can find more information on plotting here.
This week, we'll be using data from the Iris flower data set, a well-known data set in the world of data analysis. The data set consists of 150 samples taken from three different species of Iris (Iris setosa, Iris virginica and Iris versicolor; 50 samples each). In each case, the length and width (in centimeters) of the Iris' sepals and petals were measured. The purpose of the lab is to perform some exploratory visual analysis on the data to see what we can learn.
Let's start by making sure that plots are displayed inline by issuing the magic command %matplotlib inline
and importing pandas in the usual way:
In [ ]:
%matplotlib inline
import pandas as pd
Next, let's load the data. Write the path to your iris.csv
file in the cell below:
In [ ]:
path_to_csv = "data/iris.csv"
If you open the CSV file manually (e.g. using spreadsheet software), you will see that it contains six columns: four describe the measurements of the data (i.e. the sepal and petal widths and lengths), while the remaining two describe the species of the data (i.e. Iris setosa, Iris virginica or Iris versicolor) and the index number of the sample, which ranges from 1 to 50 for each species.
If we index the measurement data using both the species and sample number information, we can refer to individual measurements per species. Fortunately, with pandas, this is easily done using the same read_csv
command we used last week. The only difference this time is that we must pass a list of column names (in order) to use as the index.
Note: Last week, we passed
index_col='year'
, so that we could index the baby name information in the data frame by year. This week, we'll use the sameindex_col
argument, but instead pass a list of strings to use as indices, i.e.index_col=['index_1', 'index_2', ..., 'index_n']
.
Execute the cell below to load the data into a pandas data frame and index that data frame by the species
and sample_number
columns:
In [ ]:
df = pd.read_csv(path_to_csv, index_col=['species', 'sample_number'])
Let's take a closer look at the data we've just loaded. We can start by using the head
method to take a peak at the first few rows:
In [ ]:
df.head()
As you can see, we now have two index columns on the left side of the data frame: species
and sample_number
.
Let's pick up where we left off in the last lab and compute some summary statistics. We can do this the same way we did last week, using the describe
method, like this:
In [ ]:
df.describe()
This is good, but what if we want to see data relating to a specific species? This is where our index helps. Because we indexed the data frame using both species and sample numbers, we can reference all the data for a single species by indexing into the data frame using just that species name. For instance, to compute summary statistics for Iris setosa alone, we can write:
In [ ]:
df.loc['setosa'].describe()
As you can see, the summary for Iris setosa counts just fifty values, rather than the 150 values in the summary of the entire data frame.
What about comparing statistics across different species? Again, with pandas, this is easy. Last week, we saw that pandas has several methods for computing statistics for data frames, e.g. mean
, median
, min
, max
and std
. Each of these methods accepts an optional level
argument, which refers to the level of the index you want to compute the function on. For instance, to compute the mean of each sepal and petal length and width for each species, we can write:
In [ ]:
df.mean(level='species')
Here, we've set level='species'
because we have two index columns: species
and sample_number
. If the level
argument is not specified, calling mean
on a data frame has the effect of computing the mean of each column across all indices:
In [ ]:
df.mean()
So why bother learning methods like mean
, median
and std
when we can just call describe
to get a complete summary? One reason is because the describe
method does not accept a level
argument and so calls to it will compute statistics for all of the data in the data frame, unless we index into the data frame first with a specific index reference (like we did earlier when computing a summary for Iris setosa).
Another reason is that calling these specific instance methods allows us to make further calls to pandas built-in visualisation methods. Let's take a look at some of these in more detail next.
Manually examining tables of numerical values can be an exhausting experience. For instance, when we computed the mean sepal and petal widths and lengths for each species earlier, we ended up with a table of numbers, which doesn't provide much immediate insight into how these quantities vary over species without us having to manually compare the contents of each cell. In cases like this, using a visual technique can be a much better option because it gives us an immediate intuitive sense of what's going on. Let's take a look at a few commonly used techniques next.
Most of pandas plotting functionality is contained in the plot
method of the data frame, which is itself a wrapper for the matplotlib
plotting library. To compute a bar chart of some data, all we need to do is call the plot
method on it with the optional argument kind='bar'
to specify that we want a bar chart.
Note: If
kind
is not specified, thenplot
defaults to a line plot.
For instance, to create a bar chart of the mean value of each column in our data frame, all we need to do is call mean
on the data frame and then call plot
on the output of our call to mean
(remembering to set kind='bar'
too!), like this:
In [ ]:
df.mean().plot(kind='bar');
But what if we want to visualise how these quantities vary across species? Easy! All we need to do is pass level='species'
to the mean
method, like earlier:
In [ ]:
df.mean(level='species').plot(kind='bar');
We can also produce stacked bar charts, by passing the optional argument stacked=True
, like in the cell below:
Note: By default, if it is not specified,
stacked=False
.
In [ ]:
df.mean(level=0).plot(kind='bar', stacked=True);
We can also make the bar chart horizontal rather than vertical by setting kind='barh'
, like this:
In [ ]:
df.mean(level=0).plot(kind='barh');
Note: pandas provides lots of options for plotting. You can find a comprehensive list (with sample code) here.
We now have a visual representation of the mean values of each sepal and petal measurement across each species. We don't need to worry about creating a chart legend, choosing bar colours or labelling our $x$ axis - pandas (via matplotlib) does this all automatically.
At the start of this lab, the Iris data set was described as containing 150 samples in total, consisting of 50 from each species. Up until now, we haven't really questioned whether this is true, although as good data analysts we should have! Let's rectify our oversight now.
As it turns out, counting things in pandas is a pretty trivial task. All we need to do is call the count
method of the data frame and we get a summary of the number of data points in each column:
Note: If our columns have missing data (the Iris data set doesn't), then
count
will return the number of non-missing (i.e. valid) data points. We'll see this in a bit more detail next week.
In [ ]:
df.count()
Like the mean
method, count
accepts a level
argument, so we can tell pandas to count the number of non-missing data points each species for each species like this:
In [ ]:
df.count(level='species')
It's a trivial example because each of the species has an equal number of samples, but why don't we use a pie chart to visualise this data?
Note: In many cases, sample categories are not evenly distributed, so examining the proportion of samples in each category is a valuble habit to form.
Creating pie charts with pandas works in a similar way to creating bar charts: we call plot
on the data we want to visualise, passing kind='pie'
instead of kind='bar'
to specify the output chart type. There is one small difference though: pie charts cannot represent more than one column of data at a time, so we must tell pandas to make individual plots for each column by passing the optional argument subplots=True
, as in the cell below:
Note: While pie charts are a good technique for visualising proportions, they are a poor technique for visualising almost anything else!
In [ ]:
df.count(level='species').plot(kind='pie', subplots=True);
The pie charts look a bit squashed, right? We can fix this by passing the optional argument figsize=(width,length)
, which adjusts the size of the chart output. For instance, to set the figure size to 16x4, we can write:
In [ ]:
df.count(level='species').plot(kind='pie', subplots=True, figsize=(16,4));
Note: Setting the figure size like this works for any call to pandas'
plot
method. Try adjusting the size of one the bar charts we made earlier for practice!
This is better, but the species labels in the chart are overlapping with the column names and the legend, which makes them harder to read. We can fix this by passing adjusting the rotation of the charts with the startangle
argument, as follows.
In [ ]:
df.count(level='species').plot(kind='pie', subplots=True, figsize=(16,4), startangle=90);
We can fix this by reorganising the plots from a 4x1 grid into a 2x2 grid using the optional layout
argument. To do this, all we need to do is pass layout=(2,2)
to the plot command (also setting figsize=(12,12)
to compensate for the change in size), like this:
In [ ]:
df.count(level=0).plot(kind='pie', subplots=True, figsize=(12,12), startangle=90, layout=(2,2));
In lectures, we've seen how histograms can be used to visualise the distribution of the data in a sample. Again, using pandas, this is easily done using the plot
method. The only difference this time is that we must set kind='hist'
. For instance, to plot histograms for each column in the data frame, we can write:
In [ ]:
df.plot(kind='hist');
If we want to plot data for just one species of Iris, then we need to index into the data frame first (like we did with the describe
method earlier), like this:
In [ ]:
df.loc['setosa'].plot(kind='hist');
Like with pie charts, we can force pandas to create a plot for each variable type by specifying the options subplots=True
and layout=(2,2)
(the figsize
option is also used to make sure the plots are big enough to see), like this:
In [ ]:
df.loc['setosa'].plot(kind='hist', subplots=True, layout=(2,2), figsize=(12,6));
Next, we can increase the number of bins to get some finer detail about the distributions using the bins
argument, like this:
In [ ]:
df.loc['setosa'].plot(kind='hist', subplots=True, layout=(2,2), figsize=(12,6), bins=30);
Try varying the number of bins to see the effect it has on the disributions. Remember, if the number of bins is too large, you'll start to get a "broken comb" look.
Creating histograms for a single species is informative, but we're missing out on the bigger picture. One of the most important aspects of exploratory data analysis is determining whether there are any relationships in your data and one of the best techniques for visualising this is the scatter plot matrix.
Before we get to building a matrix for our data, let's consider the quantitative approach first: computing correlations for each pair of variables in our data set. With pandas, this is easy - all we need to do is call the corr
method on our data frame, like in the cell below, and pandas computes the Pearson correlation coefficient for each pair of variables in our data set and presents the data as a table.
In [ ]:
df.corr()
As you can see from the data, petal length is highly positively correlated to sepal length.
Note: By definition, a data sample is always completely positively correlated to itself (i.e. $r_{xx} = 1$). This is why the diagonal entries in the table above are all equal to one.
Next, let's consider the qualitative approach: in pandas, we can make a scatter plot from a data frame by passing it to pandas' scatter_matrix
function, like this:
Note: Unlike the other methods we've covered today,
scatter_matrix
is not an option we can pass toplot
, i.e. we cannot calldf.plot(kind='scatter_matrix')
. Instead, we must always pass the data frame to the method as in the cell below.
In [ ]:
pd.plotting.scatter_matrix(df, figsize=(12, 12));
This is good, but the points in our scatter plot are a little small. We can change this by specifying the optional s
argument (s
stands for size), like this:
In [ ]:
pd.plotting.scatter_matrix(df, figsize=(12, 12), s=100);
This is much clearer! We can now easily spot correlations visually, e.g. petal width and petal length, which will help us form hypotheses about what the final outcome of our analysis might be. For example, in this case, we might conclude that species with larger petal widths have larger petal lengths.